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Abstract 

We construct data dependent upper bounds on the risk in function learning prob- 
lems. The bounds are based on the local norms of the Rademacher process indexed 
by the underlying function class and they do not require prior knowledge about the 
distribution of training examples or any specific properties of the function class. Using 
Talagrand's type concentration inequalities for empirical and Rademacher processes, 
we show that the bounds hold with high probability that decreases exponentially fast 
when the sample size grows. In typical situations that are frequently encountered in 
the theory of function learning, the bounds give nearly optimal rate of convergence of 
the risk to zero. 

1 Local Rademacher norms and bounds on the risk: 
main results 

Let (S, A) be a measurable space and let T be a class of ^4-measurable functions from S into 
[0, 1]. Denote V(S) the set of all probability measures on (S, A). Let f G T be an unknown 
target junction. Given a probability measure P G V(S) (also unknown), let (Xi,...,X n ) 
be an i.i.d. sample in (S, A) with common distribution P (defined on a probability space 
(Q, S,P)). In computer learning theory, the problem of estimating f Q , based on the labeled 
sample (X%, Yi), . . . , (X n , Y n ), where Yj := f (Xj), j = 1, . . . , n, is referred to as function 
learning problem. The so called concept learning is a special case of function learning. In 
this case, T := {Iq '■ C G C}, where C C A is called a class of concepts (see Vapnik (1998), 
Vidyasagar (1996), Devroye, Gyorfi and Lugosi (1996) for the account on statistical learning 
theory). The goal of function learning is to find an estimate f n := f n ((Xi, Yi), . . . , (X n , Y n )) 
of the unknown target function such that the Li-distance between f n and fo becomes small 
with high probability as soon as the sample size becomes large enough. The Lx-distance 
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P\fn — fo\ is often called the risk (also the generalization, or prediction error) of the estimate 
/„. A class T is called probably approximately correctly (PAC) learnable iff for all e > 

7r„(jF; e) := sup sup P{P|/ n — / | > — > as n — > oo. 

PeP(S) / 6^ 

The bounds on the probability 7r„(jF; e) are of importance in the theory. Such bounds allow 
one to determine the quantity 

Nr(e;8) := inf{n : n n (F; e) < 5}, 

which is called t/ie sample complexity of learning. Unfortunately, a bound that is uniform 
in the class of all distributions V(S) is not necessarily tight for a particular distribution P 
and often such a bound does not provide a reasonable estimate of the minimal sample size 
needed to achieve certain accuracy of learning in the case of a particular P. 

A natural approach to the function learning problem (in the case when / G J-) is to 
find f n G T such that f n {Xj) = fo(Xj) = Yj for all j = 1, . . . ,n. In learning theory, such 
an estimate f n is called consistent (this notion should not be confused with consistency in 
statistical sense). 

We construct below a data dependent bound on the risk of a consistent estimate f n . 
More precisely, given 5 > 0, we define a quantity 

$ n (F; 6) = $ n {T- 5; (X u Y,), ... , (X n , Y n )) 

such that for any consistent estimate /„ 

sup supF{P\f n -fo\>p n (F;5)}<5. (1.1) 

pev(s) her 

We'll consider a couple of important examples in which the bound we suggest gives nearly 
optimal rate of convergence of the risk to as the sample size tends to infinity. 

Given a class Q of ^4-measurable functions from S into [0, 1] with G Q, let Q n denote 
the restriction of the class Q on the sample (Xi, . . . , X n ). Consider a quantity 

7n(£; 5) = %(G n ; 8;X U ..., X n ) 

such that the bound 

sup F{Pg n > %{Q;S)} < 5 
Pev(s) 

holds for any class Q and for any function g n G Q satisfying the conditions g n {Xj) = for 
all j = 1, . . . ,n. 
Define 

Wo) : { / /„ : / < :F} 

(note that the values of the functions from this class are known on the sample (Xi, . . . , X n )) 
and 

Mfo) ■= {{\f-h\{X 3 ):l< 3 <n):fer} 
= {QfiXA-Yjl-.lKjKnt-.fe?}. 
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If f n is a consitent estimate, then the function g n := \ f n — f Q \ G ^(/o) satisfies the condition 
g n (Xj) = for all j = 1, . . . , n. Then, clearly, for any consistent estimate f n , 

SUp SUp P{P\f n - f \ > 7n(^(/ ); 8) } < 5. 

PeV(S) her 
Therefore if one defines (for Yj = fo(Xj)) 

$ n {T- 6; (X 1 , FO, . . . , {X n , Y n )) := %(f n (f ); S-X^..., X n ), 
then (1.1) holds. 

These considerations show that the problem can always be reduced to the case f = 0. 
To simplify the notations, we make this assumption in what follows. 

We also assume for simplicity that T is a countable class of functions. This condition can 
be easily replaced by standard measurability assumptions known in the theory of empirical 
processes (see, e.g., J3j or [T3|; we do not make countability assumption in some of the 
examples below). Estimates f n are supposed to be E x ^-measurable. We denote by P n the 
empirical measure based on the sample (Xi, . . . , X n ) : 

n 

P B :=n- 1 X)fe i) 

where 5 X is the probability measure concentrated at the point x G S. We also use the notation 
|| ■ for the sup-norm of functions from the class JF into K : 

||Y||^:=sup|r(/)|. 

Our approach is based on the following simple idea. Denote B(r) := {/ : P\f\ < r} and 
set = 1. It's clear that for any consistent estimate /„ P n f n = and, hence, 

Pfn < Pnfn + \\P n ~ P\\f = \\Pn ~ P\\t = \\P n ~ P\\FnB(r%) ='■ r \- 

Therefore, f n G T f] B(r™). It means that actually 

Pfn — Pnfn + Kn ~~ P\\mB r n = \\P n ~~ P\\mB r n- 

We can repeat this recursive procedure infinitely many times. Namely, if rj? +1 := \\P n — 
P\\rr\B(r n ), then, by induction, Pf n < for any natural k. It is also clear that the sequence 
{r£} is nonincreasing Indeed, by a simple induction argument, we have that r£ < r^_ x implies 
that 

r l+i = \\Pn - P\\rr\B(r%) < \\Pn - PWrnB^J = rl- 
Thus, the following proposition holds. 

Proposition 1 The sequence {r%}k>i is nonincreasing and for any consistent estimate f n 
Pfn < inf fe > r£. 
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The sequence {r^}k>i depends not only on the data; it also depends explicitly on the 
unknown distribution P, so it can not be used for the purposes of bounding the risk. However, 
there is a simple bootstrap type approach that allows one to get around this difficulty 

The Rademacher process indexed by the function class T is defined as 




t=i 



where is a Rademacher sequence (an i.i.d. sequence of random variables taking the 
values +1 and —1 with probability 1/2 each) independent of {X,}. It has been used for 
a long time to obtain the bounds on the sup-norm of the empirical process indexed by 
functions (in the so called symmetrization inequalities, see |13j). Recently, Koltchinskii jE] 
(see also jjj) suggested to use ||-R„||jr as data-based measure of the accuracy of empirical 
approximation \\P n — P\\r in learning problems and developed a version of structural risk 
minimization in which the norms of Rademacher process play the role of data-dependent 
penalties. Lozano [S] compared this method of penalization with the method based on VC- 
dimensions and the cross-validation method and found out that in the so called problem 
of the "intervals model selection" the Rademacher penalization performs better than other 
methods. Hush and Scovel (1999) used Rademacher norms to obtain posterior performance 
bounds for machine learning. However, the "global" norm of Rademacher process does not 
allow one to recover the rate of convergence of the risk to in the case when fo G T (the 
so called zero error case). To address this problem, we define below a sequence of localized 
norms of Rademacher process that majorizes the sequence {r%} defined above. 
Given e > 0, let (p be a (random) function defined by 

<p(r) := ^lH-RnllmBf,, + K 2 \fre + K 3 e, 

where = {/ G T : P n f < r} and Ki, K 2 , K 3 > are numerical constants. 
We introduce the following data-dependent sequence 

v fe}fc>o = {^fc(Xi> • • • ) X n ; ei, ... , e n )} k > , 



fS = l, f n k+l = ^{f n k ) Al, A; = 0,1,2,... (1.2) 

Since the function (p is nondecreasing, a simple induction shows that the sequence {f%} is 
nonincreasing. 

Theorem 1 There is a choice of numerical constants Ki, K 2 ,K 3 > such that for all 
P G V(S), for all N > 1 and for any consistent estimate f n 

v{Pfn>f n N } < 2Ne~^. 

Thus, if one chooses N > 1 and, for a given S > 0, e > (\og2N5)/n, then one can define 
fi n {T\ 5) :— fjy to get the bound (1.1). The question to be answered is how large should be 
the number of iterations N to achieve a reasonably good upper bound on the risk in such a 
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way (if it is possible at all). Surprisingly, under rather general conditions the upper bound 
becomes sharp after very few iterations (roughly the number of iterations iV is of the order 
log 2 log 2 (±)). 

In what follows, given a (pseudo)metric space (M;d), we denote Nd(M;e) the minimal 
number of balls of radius e, covering M, and Hd(M; e) := log Nd(M; e). Also, for a probability 
measure Q on (S,A), dq 2 denotes the metric of the space L 2 (S; dQ). 

Given a class of functions J 7 , assume that 

n 
i=l 

for some concave nondecreasing (random) function tp n . Usually the role of tp n will be played 
by the random entropy integral 

r 

Mr)=K J H 1 d / ^ 2 (T,u)du 
o 

or by some further upper bound on the random entropy integral. Let us denote by 5 n : = 
S n (Xi, . . . , X n ) the solution of the equation 

5 n = n~ 1/2 ip n (\/Sn) . 
The following theorem gives the upper bound on the quantity f%. 

Theorem 2 // the number of iterations is equal to N = [log 2 log 2 e^ 1 ] + 1, then for some 
numerical constant c > and for all P e V(S) 

P [f n N > c(S n V e)) < ([log.log.e- 1 ] + l)e"^. 

Example 1. Learning a concept from a VC-class. Consider the case of the concept 
learning, when T := {Ic '■ C G C}. Given a sample (Xi, . . . ,X n ) with unknown common 
distribution P G V(S), we observe the labels {Yj := Ic (Xj) : 1 < j < n} for an unkown 
target concept C G C. An estimate C n = C n ((Xi,Y]), . . . , (X n ,Y n )) of the target concept 
C is called consistent iff 1$ (Xj) = Yj for all j = 1, ... , n. Let 

A C (X 1 , ...,X n ):= card({C n {X u . . . , X n } : C G C}). 

Then 

Mr) :=K(\ogA c (X 1: ...,X n )) 1 / 2 r 
is an upper bound on the random entropy integral, which yields the value of S n 

* T ^ 2 \ogA c (X u ...,X n ) 

n 
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Thus, with the same choice of N we get for some numerical constant c > the bound 

P (f& > c( l0gAC( ^"" Xn) V e )) < (peg, log, e" 1 ] + l)e-¥. 

Theorem 2 implies at the same time that for any consistent estimate C n we have P(C n ACo) < 
f^r with probability at least 1 — 2Ne~ n£ ^ 2 . This shows that for a VC-class of concepts C 
with VC-dimension V(C) the local Rademacher norm (which, according to Theorem 2, 
is an upper bound on the risk of consistent concepts C n ) is bounded from above by the 
quantity 0(V (C) log n/n). Up to a logarithmic factor, this is the optimal (in a minimax 
sense) convergence rate of the generalization error to (see, e.g., [3]). 

Next we consider the conditions in terms of entropy with bracketing H\\(J-,£) : = 
logiV[ ](JF, e). Here A r [](jF, e) denotes the minimal number of "brackets" [f~,f + ] '■= {/ : 
f~ < f < f + } with d Pj2 (f~,f + ) < e (f~,f + being two measurable functions from S into 
[0, 1], such that /~ < /+). Let 

^ n (r)= I" (H {] (F,u) + l) 1/2 du. 
Jo 

and let 6[ n ] = S[ n ] (P) be the solution of the equation 

5[n] = n- l/2 i) { ](yj6w). 

Again, we set for some e > N := [log 2 log 2 e' 1 ] + 1. Then the following theorem holds. 
Theorem 3 There exists a constant c > such that for all P G V(S) 
P {r n N > c(S [n] (P)Ve)) < ([log^og^- 1 ] + l)e"T. 

2 

In particular, if H[ ](JF; u) = 0(m -7 ), where 7 < 2, then ip[ i(r) x r 1-7 ' 2 and 5[ n ] x . 

Example 2. Learning a concept from a d-dimensional cube. Let S = [0, l] d . We 
consider a problem of estimation of a set (a concept) C C [0, l] d , based on the observations 
(Xj,Yj), j = l,...,n, where Xj, j = l,...,n are i.i.d. points in [0, l] d with common 
distribution P and Yj := Ic Q (Xj), j = l,...,n. Such a model frequently occurs in the 
problems of edge estimation in image analysis (see Mammen and Tsybakov (1995)). Assume 
that the distribution P has a density p such that for some B > 

B,- 1 < p(x) < B, x E [0, l] d . 

Let C be a class of Borel subsets in [0, l] d such that C 9 Co. Let A be the Lebesgue measure 
on [0, l] d . Denote Nj(C] e) the minimal number of brackets [C~ , C + ] := {C : C~ C C C C+} 
with A(C + \ C _ ) < £ (C _ , C + being two measurable subsets in [0, l] d such that C~ C C + ). 
Let Hj(C; e) := log iV/(C; e). This version of entropy with bracketing is often called "entropy 
with inclusion". We define 

V>i(r)= r (Hj(C,u) + l) 1/2 du, 
Jo 
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and let 5^ = S^(P) be the solution of the equation 
If we have 

Hj(C;u)=0(u-<), 
then Theorem 4 easily implies that with some constant c > 

P {f n N > c(# V ej) < ([log^og^- 1 ] + l)e-^, 

where 5^ x n~~ . By Theorem 2, for any consistent estimate C n of the set Co (i.e. such 
that (Xj) = Yj, j = 1, . . . , n), the quantity fjy is an upper bound (up to a constant) on 
\{C n AC ). 

In particular, if C is the class of sets with a-smooth boundary in [0, l] d , then well 
known bounds on the bracketing entropy due to Dudley (see e.g. Dudley (1999)) imply that 
7 = and 8 T n = n~ d - 1 + a . Similarly, if C is the class of closed convex subsets of [0, the 

rate becomes 5^ = n~d+ I . It was shown by Mammen and Tsybakov (1995) that both rates 
are optimal in a minimax sense. 

The examples above show that the local Rademacher penalties (defined only based on 
the data and using neither prior information about the underlying distribution, nor the 
specific properties of the function class) can recover the optimal convergence rates of the 
estimates in function learning problems. 

2 Proofs of the main results 

The proofs of the results are based on a version of Talagrand's concentration inequalities for 
empirical processes, see jTTj, ^2]- The version of the inequalities we are using, with explicit 
numerical values of the constants involved (that determine the values of the constants in 
our procedures, such as Ki, K 2 , K 3 above) are due to Massart (1999). These inequalities are 
also very convenient for applications since the quantity a 2 (the sup-norm of the variances, 
see below) they involve is very easy to bound. It should be also mentioned that the idea 
to use Talagrand's concentration inequalities to bound the risk in nonparametric estimation 
and, especially, in model selection problems goes back to Birge and Massart (see P and 
references therein). 

We formulate now Massart's inequality in a form convenient for our purposes. 

Theorem 4 Let T be some countable family of real valued measurable functions, such that 
ll/lloo < b < oc for every f G T . Let Z denote either \\P n — P\\j: or ||i? n ||_^. Let o 2 = 
nsup Var(/(Xi)). Then for any positive real number x and < 7 < 1 

P(Z > (1 + l)EZ + \o-V2kx~ + k(j)bx]/n) < e~ x , (2.1) 

where k and ^(7) can be taken equal to k = 4 and £(7) = 3.5 + 327 -1 . Moreover, one also 

has 

P(Z < (1 - 7)EZ - [aV2k^ - k'(j)bx]/n) < e~ x , (2.2) 
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where k' = 5.4 and k'{j) = 3.5 + 43.27 l . 

Proof of Theorem 2. Let for any fixed real positive number r 

Vli. r ) = \\Pn ~ P\\rnB(r) 

<p 2 (r) = (1 + 7 )E||P„ - Py nB (r) + 2v^ + (1-75 + l&y-^e. 
2(1 + 7) 



l _ y 



\Rn\\rnB(r) + V^Arl+ (1.75 + 21.Q 1 '- 1 )e 
+ 2^/rl + (1.75 + 167 _1 )£. 

Then, for any r > 



^i(r)<^(r)<^ 3 (r))>l-2e-¥. (2.3) 

Indeed, in order to apply inequalities (j2.1j) and ()2.2j) . we notice that for every / G 
.Pf^-E^r) the sup-norm ||/||oo < 6= 1 and 

<T = sup riVai(f(X)) < sup nP/ 2 < sup nPf < nr. 

FV\B r FnB(r) TC\B{r) 

Moreover, if we set x = ne/2, then ()2.1)1 implies 

> (l + 7 )E||P n -P|| 
+ (1.75 + 167" 1 )e) < 



1 



and (|2.2j) implies 

> (1 - 7') -1 [l|Pn||^nB(r) + V5Are 
+ (1.75 + 21.6 7 '" 1 )e]) < e - ^. 

Taking into account the symmetrization inequality 

E ||Pn — P\\FnB(r) < 2E||P n ||jr nB ( r ), 

we get (2.3). 
We set 



2(1+^) 2^4(1^) 
1-7' i-y 

K 3 := 2 [ 1 + ^ (1.75 + 21.67'- 1 ) + (1.75 + I67- 1 ). 
1 — 7' 

Let us introduce the following sequence: := 1 and rj£ +1 = ^ifk) A 1 for A; = 0, 1, 2, ... . 
Since ^2 is nondecreasing, it's easy to prove by induction that the sequence {fj?} is nonin- 
creasing. 
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We will also prove by induction that for all k > 

pjr™ < r n { < ff, i < A;} > 1 - 2ke~^ . (2.4) 

For k — (2.4) is trivial since t-q = = = 1. We proceed by the induction argument. 
Let us introduce the events 

A k = K < f? < f?, i < A;} and £ fc = {^(rj) < y? 2 (f£) < ^(r")}. 

To make the induction step, let us assume that we have already proven that 

P(A) > 1 - 2ke-^. 



Then (2.3) implies 
On the event Ak f] B k , 
since for / e JFp|5(f^) 



P(B fc ) > 1 - 2e-ir. 

^ns(f£)c.Fnfl e (2fjj), 



PJ < P/ + \\Pn ~ P\WnB(fi) <r n k + \\P n - P\\rnB{ri) 
= rl + ip^l) < fl + V2 (r n k ) = rl + f« +1 < 2f«, 

which implies that the inequalities ^zij'k) < V^(^fc) < V?(^fc) = ^fe+i hold. Therefore, on the 
event A k f]B k , 

r n k+1 = <Pi(r n k ) < Mr n k ) < V^t) = r'l +1 < ^M) < f n k+1 - 
So, A k f]B k C ^4fe + i, that completes the proof of the induction step 

F(A+i) > l-2(fc + l)e-^. 

It follows that 



2 



P(r™ > f™ ) < 2iVe~ 
and since, by Proposition 1, Pf n < r^, we conclude that 

F{Pf n >f n N }<2Ne-^. 

Proof of Theorem 3. Let (fi £ ,S e ,P e ) denote the probability space on which the 
Rademacher sequence £i, ...,£„,.. . is defined, E e being the expectation with respect to 
P F . We introduce the function 



2(1+7) 



l + 7 " 1 )E £ \\R n \\ J , nB e {2r) + 2y/ : Fi 



1 -y 

+ (1.75 + 167 //_1 )£ + V5Are + (1.75 + 21.67 , ~ 1 )£ 

+2v^i+ (1.75 + 16 7 ' 1 )e, (2.5) 
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where 7" > 0. The inequalities ()2.1|) and ()2.2|) also hold for the conditional probability 
and the process Z = R n with fixed Xx, . . . , X n . Therefore, for any r > 

W e (<p(r) < 934(f)) > 1 - . 

Define a sequence 

f? = ^ 4 (l), r fc " +1 =^)M, fc = 0,l,2,... 
By the induction argument, similar to the one we used in the proof of theorem 2, we get 

N 



(f]{r?<f?}) >1-Ne*. 



If we prove that rjj? < for a sequence a^, independent of e\, . . . , e n , then the unconditional 
probability 



By the assumption we have 

n 

^eWn' 1 ^e^Xi||B«(r)rv < i>n{\/r)- (2.6) 



i=l 



Hence, we can choose c > 1, depending on the parameters 7, 7', 7" in the definition ()2.5|) 
of the function ^4, in such a way that 

f n k+ i = M^) <c(e + {fief I 2 + n-^ n (y/ffy . 

The above inequality implies by induction that the sequence 

r = 1, r k+l = c(e + (r k e) 1/2 + n~ 1/2 ^„ (^)) A 1, 



majorizes the sequence f£. 

It's clear that in the case when r\ < 1 the sequence is decreasing and it converges to 
the solution 5 of the equation 



5 = c [e 



+ (fe) 1 /2 + n -l/2^ (VS)) 



Let us study the behaviour of the difference d k := r k — 5. Since the function ip n is concave, 
we have 

fin(V6) < MV6)/V6. 

The definition of 5 implies that 



V. Koltchinskii and D. Panchenko 



11 



Therefore 

4+i = r k+1 -5 = c (n~ 1/2 i[j n (y/r\) - n~ 1/2 ^„(v / 5) + y/ne - VSitj 

< c (n' 1 ' 2 ^) + v^) Vn^S < c (n- 1/2 i> n (V5) + y/8e) /V5^/d~ k 

< \fJd~ k . 

We have proven that the sequence d k satisfies the following inequality 

4+i < y/Sdic, k>0. 
Now it's easy to show by induction that 

d N < s 2 - 1+ - +2 ~ N = s 1 - 2 -" . 

Going back to the sequence r k , we get that 

r N = 5 + d N < 5 (l + r 2 "") . 

Since the definition of 5 implies that S^ 1 < e^ 1 , then the choice of 

N= [logalogae" 1 ] +1 

guarantees that 5~ 2 N < 2 and, hence, < (1 + 2)5 = 35. What remains to do in order to 
finish the proof of the theorem, is to bound 5 by the maximum of e and the solution 5 n of 
the equation 5 n = n~ 1 / 2 i[j n (\/~5^). Actually, we will prove that 5 is bounded dy 5" := (3c) 2 5', 
where 5' = (ji n V e^j First of all let us notice that the fact that ip n is concave and ^ n (0) = 

implies that for c > 1 -ip^cx) < op n (x). Also note that, since 5' > 5 n , the concavity of ■?/>„ 
and the definition of 5 n imply 

n- l ^ n ( V$) < n ~ 1/2 ^jy^ ^ = \[i n V5' < 5'. 

Combining these properties, we get 

c (e + (9c 2 5'e) 1/2 + n' 1 ' 2 ^ (3cv^)) 
< c (2^/{3cf5' + 5') < 9c 2 5' = 5". 

With necessity it means that 5 < 5" = 9c 2 (S n V e). And, hence, f% < 5" < 27c 2 {5 n V e). 
The theorem is proven. 

Proof of Theorem 4. In order to bound f k , we first construct the bound on ||.Rn||.mB e (2f fc ) 
in terms of E||P n — P\\^nB(f k ) for properly defined sequence f k . Afterwards, the expectation 
can be majorized by the bracketing entropy integral. We will show that the sequence f k can 
be chosen as follows 

r = 1, f k+1 = (ciE||P n - P\\rnB(zf k ) + c 2 Ve¥~ k + c 3 ) A 1, 
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for some large enough constants 61,62,03 > 0. One can argue similarly to the proof of 
Theorem 3 to show that the following bound holds: 

w(f){f k <f k }) >l-2ie-?. (2.7) 

k<i 

We will prove even a stronger assertion that for the event 

k<i 

we have 

HAi) > 1 -2ie"T. (2.8) 

Let us choose the constants c'^c'^c^ > and 61,62,63 > in such a way that for the 
functions 

Vo( r ) = {c'l\\ P n ~ P\\mB(r) + C^y/er + c' 3 e) 

and 

<Pe(r) = (ciE||P n - P||jr n B(r) + 62^ + 6 3 e) , 
the inequalities of Massart (see Theorem 5) would imply that for any fixed r > 

<Ps(r) < <Ps(r) < (pe(r) 

with probability at least 1 — 2e~ 2 ¥ L (the function </? 3 was defined in the proof of Theorem 2). 
Clearly, we have r^+i = y?6(^fc) A 1- 

First observe that (2.8) holds for i = (since f = r = 1). Define 

Bi := W^i) < y? 5 (3r;) < Vein)}. 

Then 

To make an induction step, we first of all notice that on the event A% D Bi, we have 

fj+i = <^(fj) A 1 < </? 3 (3fj) A 1 < (p 5 (3fi) A 1 < y? 6 (3ri) A 1 = r m . 

Also, on the event A^Bi, we have .PnP e (2f m ) C FnB(3f i+1 ). Indeed, if / e .PnP e (2f m ), 
then 

Pf < 2f i+ i + \\P n — P\\^nB e (2f i+1 ) < 2f i+1 + ||P n — P||jcnB e (2f l ) 

< 2fj+i + ||P„ - P\\^nB{?,n) < 2f i+ i + </? 5 (3fj) A 1 

< 2f i+ i + y?e(3ri) A 1 = 2f i+ i + f i+ i < 3r m 

(to show that ||P n — P||jc n B(3r i ) < ^5(3^) A 1 we used the fact that the costant c[ in the 
definition of ips is larger than 1). Thus, Aid Bi C »4j+i and 

F(A+i) > l-2(i + l)e-T. 
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The proof of the induction step and of the bounds (2.8) and (2.7) is complete. 

To finish the proof of the theorem one has to bound E||P n — P\\^r\B(r)- Since for all 
g E !Ff]B(r) we have ||<?||p,2 < {Pg) 1 ^ 2 < \ft an d \g\ < 1 then by Theorem 2.14.2 in 

TV\B{r) 

< c {n~ l ' 2 ^ { j (Vr) + I {I > ^a(Vr)}) , 

where 

We can assume that fjv > <5r n i, otherwise, bound (2.7) immediately implies the assertion of 
the theorem. Therefore, f k > Sm for all k < N, which implies that 1 < y / na(y / 3?fc). Indeed, 
using concavity of ip[ ] and the definition of S[ n ] , we have 

aP= < ^= — = VnJd [n] < vW3r fc , 

which implies 

3f k > n- 1 ' 2 ^ ] > n- 1 ' 2 (3f fe ) 1/2 (l + # N (^, v^)) 17 ' ■ 

Hence, 1 < y/na(y/3fk) and 

E||P n — P\\^nB(3f k ) < erf 
Finally, with some constant c > 

h+i < c (n~ 1/2 4>i ] (VW k ) + e + Veh) . 

The proof can be completed by the argument we used in Theorem 3. 
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